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Abstract 

Automatic sequence annotation is an essential component of modem 'omics' studies, which aim to extract 
information from large collections of sequence data. Most existing tools use sequence homology to establish 
evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a 
similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the 
correct configuration is critical and can be challenging for non-specialist users. Thus, the development of 
robust automatic annotation techniques that generate high-quality annotations without needing expert 
knowledge would be very valuable for the research community. We present SmaBs, a tool for automatically 
annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s 
is composed of three modules that progressively annotate query sequences using either: (i) very similar 
homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We 
trained the system using several random sets of known sequences, demonstrating average sensitivity and spe- 
cificity values of ^85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide 
varietyof sequence datasets that outperforms the accuracy of other well-established annotation algorithms, 
and it can enrich existing database annotations and uncover previously hidden features. Importantly, SmaBs 
has already been used in the functional annotation of two published transcriptomes. 
Key words: functional annotation; genome annotation; transcriptome annotation; bioinformatic tool 



1 . Introduction 

Sequence annotation isthe process of associating bio- 
logical i nformation to seq uences of i nterest. An notations 
can include the potential function, cellular localization, 
biological process or protein structure of a given se- 
quence.' Some sequences are annotated usingdirect ex- 
perimental evidence, but most annotations are inferred 
from sequence similarities or conserved patterns asso- 
ciated with known characteristics.^"^ Large publically 
accessible databases of annotated sequences make it 
possible to automatically annotate large collections of 



unknown sequences. This is especially valuable for the 
interpretation of large sequence datasets generated by 
genome and expressed sequence tag (EST) sequencing 
projects as well as gene and protein expression experi- 
ments, such as DNA microarrays, and many other emer- 
ging research areas.^ 

Sequence annotation is also important in transcrip- 
tomic experiments that aim to identify gene clusters 
with similar expression patterns that are linked to a par- 
ticular biological process or experimental condition. 
Biological function can then be inferred from annota- 
tions shared within these clusters.^ 
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Several methods have been developed for the anno- 
tation of the large datasets generated by sequencing 
projects. These methods typically perform homology- 
based searches and infer annotations from sequences 
with high levels of homology to the query sequence. 
To detect sequence similarity, most methods use 
BLAST, a fast heuristic algorithm for identifying hom- 
ology between sequences based on local sequence 
alignments.^ 

Blast2GO is one of the most widely used tools for 
genomic annotation.^ It performs three steps: a BLAST 
search, a step that maps similar sequences to existing 
annotation associations, and finally term annotation. 
As its name suggests, Blast2GO mainly uses terms 
from the Gene Ontology (GO) controlled vocabulary 
for annotation,' ° although it also includes other anno- 
tation classes. However, Blast2GO is most commonly 
run online, which can make it more difficult to 
analyse large collections of sequences. Other methods 
can be installed locally, and so avoid problems asso- 
ciated with remote execution, such as sequence 
number limitations or resource thresholds. Blast2GO 
is also available in an offline form, but installation 
requires knowledge of relational databases that make 
it less suitable for routine use and difficult for non-spe- 
cialist users. AutoFact" can perform several types of 
BLAST search using locally downloaded domain and 
motif databases. However, it uses databases with de- 
scriptive non-standard annotation entries rather than 
controlled vocabularies such as GO. The resulting lack 
of homogeneity in terms of vocabulary or syntax can 
make it much more difficult to automatically assign 
annotations or evaluate annotation quality. 

The above methods can be applied to a wide range of 
different sequence types and organisms. Many other 
tools are specialized for specific applications. Some 
tools only deal with single annotation classes, as is the 
case for Gotcha, which only uses GO terms.' ^ Others 
are specialized for the annotation of specific sequence 
types, such as the EST-specific ESTAnnotator' ^ and 
EST-PAC.'"^ In addition, some tools are designed espe- 
cially for the annotation of specific organisms. For 
example, BLANNANOTATOR assigns non-standard 
annotations to bacterial sequences.' ^ 

Some platformsallowdiversedata sources to be inte- 
grated forthe purposesof sequence annotation and an- 
notation analysis. For example, the Babelomics suite,' ^ 
of which Blast2GO is part, contains tools for assigning 
interactions, pathways or even regulatory annotations 
to analysed sequences. Another widely used platform 
is DAVID,' ^ which extracts functional annotations 
from a variety of public genome resources, and allows 
subsequent analysis by biological enrichment. One of 
the richest annotation sources is BioMart,'^ which 
retrieves sequence annotations from EnsembI and 
other databases. 



All of these annotation methods use BLAST to detect 
and evaluate the similarity between a query sequence 
and putative homologues.^ For each query sequence, 
BLAST attempts to create alignments with database 
sequences and uses these alignments to judge se- 
quence similarity. Most annotation methods select 
the database sequence with the highest overall similar- 
ity score for each query sequence as the donor of anno- 
tation information. In addition, a minimum similarity 
threshold sets the limit below which sequences will be 
rejected even if they are the most similar sequence in 
the database. Defining the similarity threshold is not 
an easy task, as it must balance annotation quality 
and sequence coverage (SC). Moreover, it has been 
shown that the optimal similarity threshold can vary 
between different sequence types.' ^ Several other 
BLAST variables, such as how it scores alignment gaps, 
can also result in significantly different results. These 
factors can be difficult to configure, particularly for 
non-specialist users. 

We have developed Sma3s, a fast, accurate and flex- 
ible annotation tool specifically designed forthe anno- 
tation of large collections of sequences obtained from 
diverse gene libraries or coding sequence datasets. It is 
comprised of three modules that solve the annotation 
process with progressively less-stringentsequencesimi- 
larity requirements, combined with methods to opti- 
mize specificity. Each module uses an initial 
exhaustive BLAST search as its starting point. The third 
Sma3s module enhances annotation quality using 
term enrichment to identify annotations shared by 
groups of similar sequences. We have defined optimal 
defaultSma3s parameter values to minimize user inter- 
vention for most applications, thus aiding consistency 
and comparability. We show that Sma3s can rapidly 
produce high levels of prediction accuracy with 
minimal human supervision and modest computation- 
al resources. 



2. Materials and methods 

2.1. Implementation 

Sma3s was written in Perl using modules of the 
BioPerl 1.6 project (http://www.bioperl.org) and is 
freely accessible from the Sma3s website (http://www. 
bioinfocabd.upo.es/sma3s/). Sma3s uses two pro- 
grammes from the Blast package: blastp for amino 
acid sequences and blastx for nucleotide sequences, 
using an £-value threshold of 1 0~^, and blastclust for 
sequence clustering procedures. A Perl biological en- 
richment module, based on hypergeometric distribu- 
tion, was used to calculate probability values (http:// 
www.cse.huji.ac.il/course/2006/bioskill/Ex2/HyGe.pm). 

Sma3s uses the UniProtdatabase in plain-text format 
files (///e.dat), corresponding to the taxonomic division 
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of the organism under study, as its main data source. 
More details about programme requirements and op- 
eration are available on the website. 

Running on a dual-processor Intel Pentium 4 CPU 
3.00 GHz with 4 Gb of RAM, the programme takes an 
average of 0.7 s to annotate each sequence (mean 
length = 400 amino acids) against a division of the 
Swiss-Prot database (~45 000 sequences), using the 
three modules. In fact, the most time-consuming 
process (~90%) is the initial exhaustive BLAST (blastp 
or blastx), whose results are then reused in subsequent 
parameter tuning and annotation refinement steps. 



2.2. Annotation types 

UniProt^° is arguably the most complete publically 
available protein database. Human-based curation 
ensures high-quality annotations, which is particularly 
evident in the Swiss-Prot section, compared with the 
automatically annotated TrEMBL section (also part of 
UniProt). Sma3s uses Swiss-Prot as its main source for 
extracting annotations. The UniProt fields used by 
Sma3s are: 

Gene ontology GO' ° provides a controlled vocabu- 
lary to describe genes and gene product attributes. It 
is organized into three biological ontologies: molecular 
function, biological process and cellular component. 
Standardized GO term annotations are included in 
the cross-reference (DR) field of UniProt. 

Interpro InterPro^^ is an integrated documentation 
resource of protein families, domains and sites. It com- 
bines complementary sequence pattern information 
from several databases. UniProt provides InterPro iden- 
tifiers also from the DR field. 

Swiss-Prot l<eywords The Swiss-Prot l<eywords consti- 
tute a well-defined and controlled vocabulary of 
terms used to annotate a UniProt protein entry. These 
keywords (the KW field in UniProt) describe functions, 
biological processes, structure, cellular localization 
and other protein characteristics. 

Patiiway annotation This annotation provides a de- 
scription of the metabolic pathway(s) in which a 
protein is involved. It is obtained from the comment 
(CC) field containing generic and specific metabolic 
pathway descriptors (e.g.: -!- PATHWAY: Nucleotide 
metabolism; purine metabolism). Sma3s gathers anno- 
tations of the most generic level ('Nucleotide metabol- 
ism' in the previousexample). This type of annotation is 
particularly useful for identifying co-expressed genes 
that are active in the same metabolic pathway. 

New annotation types can be incorporated into 
Sma3s with only minor changes to its algorithms. 



2.3. Thresiiold calculation for the selection of multiple 
homologous sequences 
Module 3 uses a modified version of Sander's 
formula, later updated by Rost^^ to select alignments 
based on both identity and alignment length. The ori- 
ginal formulas describe the relationship between se- 
quence identity and alignment length observed in 
sequences sharing structural similarities as a curve. In 
the Rost equation, the identity threshold (p) for an 
alignment of length L is defined as: 



p'{n) = n + 480 x L 



-0.32x{-\+e-'-''°°) 



(1) 



where n describes the distance in percentage from the 
original curve. This value allows curve adjustment to in- 
crease stringency and to ensure at least 40% identity 
(n = 20) at any alignment length in accordance with 
the ~40% threshold for functional conservation 
described by Wilson et al.^^ and will be the default 
value used for Module 3 annotation. 

2.4. Statistical significance analysis of annotations 

Each term assigned by Sma3s' prediction algorithm is 
assigned a P-value to indicate annotation quality. 
Annotationsfrom Module 3 take P-valuesfrom a biologic- 
al enrichment calculation, while those from Modules 1 
and 2 are derived from the highest rated BLAST hit. 

The biological enrichment algorithm (see below) is 
based on hypergeometric distribution as follows: from 
M sequences in the database, /<" contains annotation X; 
with m sequences in the annotated cluster, the prob- 
ability P(x = k) that k sequences will have annotation 
X\s: 



P{x = k) 



M-K 
m-k 



(2) 



Assuming that /sequences were found with annotation 
X, the probability P that / can be adjusted to the null 
hypothesis is: 



minf/Cm} 

p= 

k=^ 



(3) 



The cut-off P-value was determined experimentally to 
be <0.1 (seethe Results section). 

2.5. Accuracy estimation 

Sma3s assesses three different ratios to evaluate 
method accuracy taking into account test sets of previ- 
ously annotated sequences. These ratios are based on 
TP (true positives: predicted terms corresponding to 
the real annotation), FP (false positives: predicted 
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terms not correspondingtothe real annotation) and FN 
(false negatives: non-predicted real terms). 

It is worth pointingoutthat (TP + FN) corresponds to 
the number of real annotations and (TP+ FP) repre- 
sents the number of predicted annotations. In this 
way, we can calculate both sensitivity and specificity 
indices. We can also calculate the Relative Error 
Quotient (REQ), which evaluates the overall prediction 
quality by considering both the sensitivity and specifi- 
city measurements as described in the GOtcha 
method.' ^ REQ is calculated as: 



REQ = 



FN X W+ FP 

TP X (1 + IV) 



(4) 



where Wis a weighting factor that moderates the influ- 
ence of FN and FP In this work, W has been set to 1 to 
make results comparable with the literature. Low REQ 
values indicate a low annotation error rate, whereas a 
high REQ indicates a higher proportion of errors. This 
measure of accuracy has the advantage that it encom- 
passes TP, FPand FN, thus combining sensitivity and se- 
lectivity in a single value. 

To evaluate prediction coverage percentages, two 
more values were calculated: the SC (the number of 
sequences with predicted annotations with respect to 
sequences with known annotations) and term coverage 
(TC, the number of predicted annotations with respect 
to known annotations). 



2.6. Test datasets 

2.6.1. Random sequence datasets All the databases 
used in this work belong to the UniProt Knowledgebase 
Release 201 1_08, which were obtained from the EBI ftp- 
site: ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_ 
release/knowledgebase/taxonomic_divisions/. 

We downloaded four taxonomic divisions (bacteria, 
plants, invertebrates and mammals) and builtfive inde- 
pendent random datasets of 1 000 sequences (5000 in 
total) for each taxonomic division. These datasets 
were processed with SmaSs. Query sequences were 
removed from the database before each run to avoid 
self-annotation. 

2.6.2. Sequence sets from DNA arrays Weextracted 
two independent DNA arrays from the GEOdatabase:^^ 
Affymetrix Murine 1 1 K SubB Array (AC:GPL76) and 
the Affymetrix Arabidopsis ATHl Genome Array 
(AC:GPL198). The nucleotide sequences from these 
arrays were downloaded, and sequences shorter than 
1 50 were removed (to ensure a minimum amino acid 
sequence length of 50). Reference annotations were 
taken from both Affymetrix-assigned GO terms and 
GO terms extracted from the Swiss-Prot entries referred 
to in the Affymetrix annotation. 



2.7. Blast2GO and Top-BLAST tests 

We opened Blast2GO as a Java application (http:// 
www.blast2go.com/b2ghome), selecting the 2Gb 
memory version, as would be typical for non-expert 
users performing a search with default parameters. 
The test did not include graphical data mining or en- 
richment analysis Blast2GO options. ForTop-BLASTan- 
notation, the best hit from a default BLAST search was 
used to annotate every sequence in the dataset. Self- 
prediction in Blast2GO (annotation based on annota- 
tions from the query sequence) was unavoidable, 
since dataset sequences cannot be removed from the 
database. 



3. Results 

3.1. Sma3s algorithm overview 

With SmaSs, we set out to develop a tool that can in- 
telligently assign annotations to query sequences in a 
way that maximizes accuracy without sacrificing cover- 
age, and works in a way that is robust enough to gener- 
ate high-quality results without having to adjust the 
parameters. 

To achieve these aims, we designed SmaSs as a 
modular system that processes query sequences in a 
series of discrete steps. The method starts with a BLAST 
similarity search, followed by a three-step annotation 
process which sequentially tests different degrees of simi- 
larity between query and database sequences (Fig. 1). 
The first module annotates query sequences using 
database homologues with very high levels of similarity 
(Fig. 1 , Ml ). The second module performs a reciprocal 
BLAST on sequences unannotated in the first step to 
generate annotations from orthologous database 
sequences (Fig. 1 , M2). The third module uses a novel 
strategy (Fig. 1, MS) that looks for more distantly 
related sequences and analyses these data for statistic- 
ally significant sequence annotations. 

Module 1 : The first annotator uses a Top- BLAST strat- 
egy to check whether the query sequence (or one very 
similar to it) already exists in the UniProt database, 
and then directly assigns database annotations to the 
query sequence. We established a minimum sequence 
identity of 90% and a sequence overlap of at least 90% 
of the database sequence, although these parameters 
are customizable. In all cases, SmaSs uses annotations 
from Swiss-Prot in preference to TrEMBLwhen identical 
amino acid sequences are found in both databases 
(Table 1 , see the next section for details). 

Module 2: This module annotates using information 
from putative orthologues with lower similarity. The 
second module of SmaSs is based on reciprocal BLAST 
searches. It detects lower similarity sequences with at 
least 75% of per-residue identity, covering at least 
75% of database sequence length. The best hit is then 
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Figure 1. Sma 3s workflow. The first Ml module derives an notations from highly similar sequences stored in the data base, choosing sequences 
using the Top-BLAST method, which selected the highest similarity homologue from each BLASTsearch that meetsthe minimum similarity 
criteria. The remaining sequences a re passed to the second module (M2), which performs reciprocal BLAST searches to identify orthologous 
sequences as annotation sources, also using the Top-BLAST method. Finally, the M3 module obtains annotations from a set of related 
sequences whose similarity is supported by statistically significant concentrations of similar annotations, filtered by clustering 
techniques to avoid over-representation from duplicated gene families. This figure appears in colour in the online version of DNA Research. 



Table 1. Sma3s results with different source databases 
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0.83 + 0.1 3 
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+ 0.07 


0.1 9 


+ 0.1 5 


0.87 + 0.1 3 


Prot 
















TrEMBL 


0.61 ± 0.1 5 


O.f 


16 


+ 0.08 


0.45 


+ 0.25 


0.94 + 0.06 


UniProt 


0.68 ± 0.09 


O.f 


W 


+ 0.07 


0.32 


+ 0.1 4 


0.94 + 0.05 



Sequence coverage (SC), specificity (Sp), sensitivity (Sn) and 
REQ values are shown together with the corresponding stand- 
ard deviation. 



used as an input seed in a blastp or tbiastn search 
against the initial set of query sequences. If this 
second BLAST identifies the starting query sequence 
as the best hit, then an orthology relationship is estab- 
lished, and consequently, the annotation recovered. 
As in the previous step, SmaSs uses Swiss-Prot annota- 
tions in preference to TrEMBL (Table 1, see the next 
section for details). 

Module 3: Module 3 recovers annotation informa- 
tion from multiple homologous sequences using a 
novel multi-stepapproach. The first step istodetermine 
which homologues have sufficient similarity to the 
query sequence. Work by Sander^^ and Rost^^ found 
that minimum alignment length was a better criterion 
than overall percentage sequence identity for estimat- 
ing sequence and structural homology, especially 
when comparing pairs of structurally matched proteins 
with lower levels of similarity. Intuitively, for short se- 
quence alignments, a high percentage identity is 



needed to establish statistically significant relation- 
ships. Conversely, long pairwise alignments require 
lower identity to be qualified as significant. In the 
context of protein structure, a minimum threshold of 
~20%wasfound tobeagood predictorof protein struc- 
ture, provided thatthe alignment had a length of at least 
1 50 residues. Function is less well conserved than struc- 
ture, but further studies have reported that biological 
function is typically conserved when two sequences 
exhibit 40% sequence identity.^"^ Thus, based on the 
hypothesis that function conservation is also related 
to sequence alignment length, we have used a modified 
form of Rost's equation (see Materials and Methods for 
details) that only selects sequences whose alignments 
have at least 40% identity at any alignment length. 

Although the sequences selected by this method can 
have significantly lower similarities to those identified 
by Modules 1 and 2, the combined analyses of multiple 
sequences can be used to increase annotation reliabil- 
ity. This is based on the hypothesis that annotations 
shared by several homologous sequences are more 
I i kely to reflect f u nctions sha red by the q uery seq uence. 
Module 3 uses biological enrichment to only select 
those terms which appear more frequently in the iden- 
tified homologuesthan would be expected by chance in 
the source database (see 'Statistical significance assess- 
ment of annotations' in the Materials and Methods 
section for details). 

The presence of several redundant sequences in the 
search results could bias the term enrichment 
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algorithm. To reduce this possibility, very similar 
sequences are first combined by 'clustering'. 
Redundant sequences are identified using blastclust 
(part of the Blast package), using 95% identity and 
alignment length as search criteria. Thus, when evaluat- 
ing the frequency of annotation terms, each cluster 
counts only once, regardless of the number of 
sequences it contains. 

In summary. Module 3 processes each query as 
follows (also see Fig. 1 , M3): (i) the BLAST report is 
scanned for statistically significant alignments accord- 
ing to Rost's equation; (ii) identification and grouping 
of clusters containing very similar members of the 
same family; (iii) a preliminary annotation set is 
formed by combining individual non-redundant anno- 
tations from each clusterand finally, (iv) biological term 
enrichment is used to select significantly overrepre- 
sented annotations in each group of homologous 
sequences. Owing to the modular implementation of 
the algorithm, each module can be run independently, 
allowing the user to easily customize annotation criteria. 

3.2. Sma3s annotates random Swiss-Prot sequence 
collections with high accuracy 

We trained Sma3s using random datasets extracted 
from thefourSwiss-Prottaxonomicdivisions. Five inde- 
pendent random sets of 1000 sequences (5000 in 
total) were extracted for each division. These sequences 
were then annotated using the corresponding division 
from Swiss-Prot, and each query sequence removed 
from the database at each step to avoid self-annotation. 

We evaluated the annotation of each random set of 
sequences for SC, specificity, sensitivity and REQ values 
(an accuracy measurement which combines both spe- 
cificity and sensitivity, with 0 being the best possible 
value). The results obtained varied significantly 
between taxonomic divisions, but not between differ- 
ent datasets from the same division (Fig. 2). 

To identify the optimal default cut-off value, we 
tested Sma3s with different P-values using our 
random datasets as query sequences and the corre- 
sponding taxonomic divisions of the Swiss-Prot data- 
base as annotation sources. As can be seen in Fig. 2, 
more restrictive P-values generated consistent results 
with low REQ values and higher specificity (Sp) but at 
the cost of reduced sensitivity (Sn). The highest accur- 
acy, as defined by the lowest REQ values, was obtained 
when specificity and sensitivity had equal or similar 
values. With the exception of the invertebrate dataset, 
a P-value of 0.1 resulted in average sensitivity and spe- 
cificity values >0.8. These values compare favourably 
with those of other tools, such as Blast2GO^ or 
Gotcha,^ ^ whose accuracy typically ranges between 
0.65 and 0.7. On this basis, we established a default 
P-value of 0.1 forSma3s. 



When the same test datasets were annotated using 
the uncurated TrEMBL database or the complete 
UniProt database instead of Swiss-Prot, the average ac- 
curacy was reduced despite the high coverage obtained 
(Table 1). This can be explained by the fact that the 
unreviewed TrEMBL sequence collection is much 
larger than the manually annotated Swiss-Prot set. 
Thus, a large proportion of annotations assigned from 
the UniProt database are derived from TrEMBL, even 
though UniProt also contains Swiss-Prot sequences. 

Accuracy was also calculated separately for each of 
the four annotation types recovered by Sma3s: GO 
terms, SW-Keywords, InterPro and Pathways (Fig. 3). 
Best results were obtained for Pathway and InterPro 
annotations, which had REQ values <0. 2 for all except 
the invertebrate dataset. Keyword and GO annotations 
had poorer REQ values, but still achieved average sensi- 
tivity and specificity values of 0.8 (and as high as 0.97 
for bacteria). The exception, again, was the invertebrate 
dataset for which Sma3s generated higher REQ values, 
associated with significantly lower sensitivity. The rela- 
tively poor results forthe invertebrate dataset are prob- 
ably due to the relatively small number of sequences 
present in the corresponding invertebrate database. 
Sequence length is also important. In fact, this dataset 
is the group with the highest proportion of very short 
sequences (10-40 amino acids), which are usually 
annotated with difficulty (Supplementary Fig. SI a). 

3.3. The M3 module can enrich the annotation of 
previously analysed sequences 

As described above, each dataset sequence is anno- 
tated by one of the three different modules in the 
Sma3s algorithm, which sequentially run from Ml to 
M3. To determine the effectiveness of each module, 
we calculated the percentage of random dataset 
sequences annotated by each module. On average, the 
Ml module annotated 53.76% of sequences (Table 2) 
with M2 and M3 annotating the remainder (19.70 
and 26.54%, respectively). As modules M2 and M3 
only annotate query sequences that were not anno- 
tated by the previous module(s), the ~26% of 
sequences annotated by M3 suggeststhat itsignificant- 
ly increases Sma3s annotation coverage. 

Ml and M2 modules provide highly accurate results 
since annotations are obtained from verified ortholo- 
gues with high degrees of identity between query and 
donor sequences. However, as both modules employ 
the traditional Top-BLAST approach, each query se- 
quence isannotated fromonlyone putative orthologue. 
In contrast, the M3 module typically extracts annota- 
tions from several homologous sequences identified 
in the BLAST report. To bettercompare the relative ben- 
efits of the different approaches used by Sma3s, we 
compared the annotation of our random datasets 
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1 0.1 0.01 0.001 0.00010.00001 lE-IO lE-15 lE-20 1 Q.l 0.01 0.001 0.0001 0.00001 lE-10 lE-lS lE-20 

p-value p-value 




1 0.1 0.01 0.001 0.00010.00001 lE-10 lE-15 lE-20 
p-value 



Figure 2. SmaSs annotation of random Swiss- P rot test datasets. The sequence coverage (SC), specificity (Sp), sensitivity (Sn) and REQ values 
obtained by SmaSs are sliown for (a) bacteria, (b) plants, (c) mammals, (d) invertebrates, (e) and average values, with standard 
deviation for all four datasets. This figure appears in colour in the online version of DNA Research. 



using the three modules as described above (Ml 23) 
against annotation using Module 3 alone (M3). In the 
latter case, all sequences are annotated by Module 3, 
not just the sequences unannotated by Modules 1 
and 2. Figure 4 showsthatthe mean specificity and sen- 
sitivity of the M3 module were slightly lower than the 
full Sma3s analysis, with the REQ value increasing 
from 0.1 9 to 0.32. However, M3 obtained more anno- 
tations (Fig. 4, column TC) due to its ability to extract 
annotations from multiple homologues. M3 accuracy 
was also compared with combined Ml and M2 
modules (Ml 2), which represent more conventional 
annotation algorithms. Ml 2 obtained higher specifi- 
city (0.97) since it annotates only using very similar 
sequences (Fig. 4). However, sensitivity is rather less 
(0.47) with only 64% of dataset sequences annotated. 
This results in an REQ value of 0.71 for Ml 2 versus 



the 0.32 for M3. Several examples illustrate this add- 
itional advantage of Sma3sM3 as a standalone annota- 
tion tool. 

The best BLAST hit of the pigQill protein (QIL1_PIG) 
in the mammalian dataset is the primate homologue 
(QIL1_MACFA). Because this protein entry does not 
contain any of the four annotation types used by 
Sma3s, pig Qill was not annotated by Ml. However, 
the M3 module was able to annotate pig Qill localiza- 
tion using the GO term 'Mitochondrion', obtained from 
a more distant Qill mouse homologue (QILl _MOUSE). 

The Sma3s M3 module could be used even in cases 
where incomplete annotations are present in Swiss- 
Prot. For example, when Ml or M2 Sma3s modules 
are used with invertebrate datasets, the U3-lycotoxin- 
Lsl a protein from the wolf spider (TX308_LYCSI) only 
generates the GQ term 'extracellular region'. In contrast. 
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the MB module annotates the sequence with two add- spider toxin CSTX superfamily, which blocks mamma- 
itional GO terms: 'pathogenesis' and 'calcium channel lian neuronal calcium channels,^^ these terms should 
inhibitor activity'. Since this protein belongs to the be present in the annotation of this protein. 
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Another example is the SfsA protein (SFSA_PSEF5) 
from Pseudomonas fluorescens. SfsA was annotated in 
bacterial datasets with the Swiss-Prot keyword 
'Complete proteome' and no GO terms. However, M3 
predicted the following Keywords and GO terms: DNA 
binding and carbohydrate metabolic process. These 
terms seem valid because the bacterial SfsA family 
includes sugar fermentation stimulation proteins that 
contain a helix-turn-helix motif that probably binds 
DNA at its C-terminus.^^ 

3.4. Accurate annotation of DNA array sequence sets 

To test Sma 3s with an experimental set of sequences, 
we chose collections belonging to two different Affymetrix 
DNA arrays (one mouse and one Arabidopsis). The real 
annotations for these datasets were extracted from the 
Affymetrix web server (March, 2012) as GO terms 
linked to the sequences, or as GO terms extracted 
from Swiss-Prot IDs linked to the sequences. Sma3s 
was used to annotate these arrays using default para- 
meters, and accuracy measured using the reference 
annotations. We compared these results with the trad- 
itional Top-BLAST method, and the widely used 
Blast2GOtool with default annotation settings (Fig. 5). 



Table 2. Number of annotations assigned by each SmaSs module 



Dataset 


Module 


Number of annotations 


% 


Random 


Ml 


468.45 ± 1 64.25 


53.76 




M2 


171.70 + 45.1 2 


19.70 




M3 


231.30 ± 80.72 


26.54 


DNA arrays 


Ml 


386.5 ± 369.82 


5.23 




M2 


462.5 ± 556.49 


6.26 




M3 


6536.5 ± 4531.85 


88.50 



The average number of annotations identified by Sma3s is 
shown for the random and DNA arrays datasets, together 
with the corresponding standard deviation. Note that the 
percentages for each dataset group sum 1 00%, since 
Sma3s has applied its three modules consecutively (default 
configuration). 



Arabidopsis and mouse array analyses were per- 
formed with either the Swiss-Prot (Fig. 5a) or 
Affymetrix (Fig. 5b) annotations as references. In the 
case of Arabidopsis, Sma3s showed the best accuracy 
(as measured by REQ, sensitivity and specificity), al- 
though Top-BLAST provided a better sequence and TC. 
However, there were striking differences between the 
Swiss-Prot and Affymetrix annotation sources. We 
observed a much bigger difference in REQ values 
between Sma3s (Ml 23 and M3 alone) and the other 
methods with Swiss-Prot annotations. 

In the case of the mouse dataset, Sma3s provided sig- 
nificantly better sensitivity albeit with slightly poorer 
specificity than Top-BLAST when using Swiss-Prot as 
an annotation source (Fig. 5c). A similar pattern oc- 
curred when using Affymetrix sources (Fig. 5d), but 
the differences were more marginal. In contrast to the 
Arabidopsis collection, Sma3s generated the highest 
TC values for mouse sequences with both annotation 
sources. Our results suggest that both the full and M3- 
only implementations of Sma3s analysis compare 
favourably with commonly used annotation tools, 
with equal or better REQ values in almost all cases. 
However, it is clear that the annotation results obtained 
were very sensitive to both the query dataset and the 
choice of annotation source. 

3.5. SmaSs is fast and lias low-memory usage 

In the post-genomic era, it is critical to have annota- 
tion tools that can be easily integrated into sequence 
analysis pipelines and applied quickly to large datasets. 
Therefore, it is important that annotation analyses run 
efficiently without excessive resource demands or se- 
quence limitations. To this end, we estimated the 
speed of Sma3s analysis using the modest computer 
hardware (see Materials and Methods for details). We 
tested run times and memory usage using three datasets 
with progressively increasing sizes (Fig. 6). Measurements 
were taken separately for the BLAST search and sequence 
annotation steps. Sma3s processing times increased in 
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Figure 4. Comparison of Sma 3s performance using all modules (Ml 23) or the M3 module alone. Corresponding sensitivity (Sn), specificity 
(Sp), sequence coverage (SC) and term coverage (TC) values appear above the bars. This figure appears in colour in the online version of 
DNA Research. 
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Figure 5. Annotation of two DNA arrays using Sma3s, Blast2GO and Top-BLAST. Tlie annotation prediction results of mouse and Arabidopsis 
sequences are shown from SmaSs with all modules, Sma3s with the M3 module only, Blast2GO and Top-BLAST. The corresponding values 
appear above the bars. Sn, sensitivity; Sp, specificity; SC, sequence coverage; TC, term coverage. This figure appears in colour in the online 
version of DNA Research. 



proportion to the numberof sequencesanalysed (Fig. 6a). 
Times were highly dependent on the initial BLAST step 
which increased in proportion to the sequence number. 
However, SmaSs annotation times were much lower 
and barely affected by increasing dataset sizes. 
Significantly, SmaSs run timeswere two orders of magni- 
tude shorter than those of the Blast2GO method. 

Memory usage was also measured using the same 
datasets and methods (Fig. 6b). SmaSs had an essen- 
tially constant memory usage in its two steps despite 
the sequence number increase, in contrast to the 
Blast2Go method. Our results show that SmaSs is able 
to efficiently annotate large sequence datasets using 
modest computational resources. 



4. Discussion 

Sequencing projects produce hundreds of thousands 
of new biological sequences that need to be quickly and 
accurately annotated since this information constitutes 
the basis for subsequent analysis, such as pathway 
design or genomics and transcriptomics studies. 
Therefore, fast and user friendly methods are needed 
to support sequence annotation at this stage. 

SmaSs uses three consecutive modules: Ml (which 
searches for very similar sequences in the database), 
M2 (which searches for orthologues) and MS (which 
extracts information from several homologous sequences). 
The novel approach used by the MS module selects several 



homologues for each query sequence, using a modified 
Rost equation, which takes into account both alignment 
length and percentage identity. Annotations from all the 
non-redundant sequences are pooled, but only annota- 
tions that are significantly enriched versus the expected 
background frequency are assigned to the query sequence. 
Th is has the adva ntage that usef u I a n notation i nformat ion 
present in several homologues can be extracted, rather 
than only from the sequence with the highest homology. 

Our tests have shown that SmaSs is a fast annotation 
method requiring minimal human supervision and 
computational resources. Mean SmaSs accuracy using 
different annotation types (GO, Swiss-Prot keywords 
and pathways, and InterPro) was higher than 0.8 
(except for the invertebrate dataset), and was even 
higher (0.9) for specific datasets such as bacteria. I n add- 
ition, REO values were consistently very low. InterPro an- 
notation was highly specific and sensitive (see Fig. S), 
which can be explained by the fact that current InterPro 
annotation is strongly focused on sequence similarity, 
which has been shown to be effective for protein function 
prediction 26. On the other hand, it is difficult to judge 
SmaSs Swiss-Prot Pathway annotation given the limited 
coverage of this category in the UniProt database. 
Importantly, both InterPro and Pathway use short and 
controlled vocabularies with low term diversification. 

Some annotation methods can assign sentences 
derived from the database description line 1 5. SmaSs 
can also assign this descriptor (data not shown). 
However,theaccuracyofthese non-standard annotations 
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Figure 6. Run times and memory usage of SmaSs versus Blast2GO. 
Tests were performed using datasets witli different numbers of 
sequences (500, 1 000 and 2000). Tlie different steps in botli 
tools are sliown as separate bars. Blast2GO did not report results 
with datasets >1 000 sequences due to memory problems, (a) 
SmaSs and Blast2GO run times are shown in the first and second 
bars, respectively, (b) SmaSs and Blast2GO memory usage are 
shown in the first and second bars, respectively. SmaSs includes 
Blast and annotation (term assignment to query sequences) 
steps, and Blast2GO additionally includes the mapping step 
(term extraction from the obtained hits). This figure appears in 
colour in the online version of DNA Research. 



is difficult to measure, whereas controlled terms such as 
GO or Keywords are more useful for experiments using 
massive datasets. SmaSs can be easily adapted to use dif- 
ferent annotation classes, and each term predicted by 
SmaSs can be assigned a corresponding probability 
value. For example, SmaSs also includes annotations for 
the most probable gene names for each annotated se- 
quence (data not shown), which can be useful for non- 
specialized users who only want to rapidly identify their 
sequences. Furthermore, SmaSs could be used with 
other annotation sources since the annotation assign- 
ment step can easily be disengaged from the main algo- 
rithm and used with other databases. For example, the 
annotation component could be extended to use data 
from the BioMart project, which links annotations from 
heterogeneous databases to UniProt identifiers. If 
SmaSs was converted to a web service,^^ it could bedirec- 
ted to use BioMart public web services as an annotation 
source. This would also make SmaSs more versatile bydiv- 
iding Blast and annotation steps into separate services. 



4. 7 . The reference dataset and the BLAST results 
influence annotation accuracy 

We demonstrated the accuracy of SmaSs with tests 
using different random datasets (Fig. 2). Moreover, 
SmaSs outperformed other widely used annotation 
methods when annotating DNA array datasets (Fig. 5). 
The only exception was the mouse dataset using Swiss- 
Prot as a reference. The annotation accuracy of this 
dataset is strongly dependent on reference annotations, 
which in this case came from Affymetrix. These annota- 
tions are assigned using NetAffx, which is based on 
homology transfer from heterogeneous databases, in- 
cluding Swiss-Prot, but also Unigene and LocusLink 2 7. 
Furthermore, these annotations depend on the external 
database entries associated with individual Affymetrix 
sequences, which have been shown to contain frequent in- 
accuracies 2 8.Thus,selection of a suitably high-quality ref- 
erence annotation set seems to bean important factorfor 
testingthisoranyother method. Neve rtheless,SmaSs pro- 
duced the best results in virtually all cases. 

Further evidence for the idea that database annota- 
tion influences annotation accuracy is presented in 
Table 1 , where the well-annotated Swiss-Prot division 
of UniProt provided the best accuracy (including speci- 
ficityand sensitivity), while TrEMBL provided the lowest 
values. UniProt isthe most widely used protein database 
since it contains rich information on each sequence. 
UniProt is comprised of Swiss-Prot, a curated section 
reviewed by skilled annotators and TrEMBL, an unre- 
viewed section containing additional sequences that 
can be moved to Swiss-Prot once they have been manu- 
ally reviewed. We have found that the best annotation 
accuracy is obtained when Swiss-Prot is used as the 
sole reference database (Table 1 ). Conversely, accuracy 
was reduced when using either TrEMBL or the complete 
U n i Prot data base, ma i n ly d ue to lower sensitivity. These 
lower values are expected partly because the test data- 
sets were extracted from Swiss-Prot, and partly because 
TrEMBL is a much larger database. Thus, when UniProt 
or TrEMBL are used as reference sources, a higher pro- 
portion of SmaSs annotations are TrEMBL-derived, 
which consequently lowers the average annotation 
accuracy. 

Similarly, it is importantto highlightthat results were 
also dependent on the quality of annotated reference 
sequences for each taxonomic group. For example, 
SmaSs annotation of random invertebrate sequences 
was of a much lower quality than those of bacterial 
sequences. The main difference between the two 
Uniprot datasets is in the number of sequences they 
contain: 24 050 invertebrate sequences versus 
326 570 bacterial sequences. In order for Module 3 
to assign an annotation, it has to be statistically 
enriched within the homologous sequences. The prob- 
ability of finding enriched annotations is more likely 
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with the higher numbers of candidate sequences avail- 
able in the bacterial dataset. In fact, we have found that 
a combination of TrEMBL and SWISS-PROT databases 
increases sensitivity when annotating weakly con- 
served genomes (unpublished observation). However, 
this approach results in a higher rate of false positives. 

Sequence length may bean important factor for suc- 
cessful annotation. For example, short sequences occur 
most frequently in invertebrates and to a lesserextent in 
plants (Supplementary Fig. SI a). These sequences are 
rarely annotated due to the low alignment scores they 
produce. The Ml module, which requires very high 
levels of similarity, is typically the most effective 
Sma3s component for annotation of these sequences 
(Supplementary Fig. SI b). This effect is likely to be 
enhanced by the higher diversity of the invertebrate 
dataset versus the bacterial one. 

Another key factor for successful annotation is the 
initial BLAST search. In order for a sequence to be anno- 
tated by any of the three Sma3s modules, the initial 
BLAST must identify sequence alignments. For example, 
of the 30% of mouse array sequences which were not 
annotated by Sma3s (SC = 0.7 in Fig. 5), 86% lacked 
BLAST results, making annotation impossible. In the 
case of Arabidopsis array sequences, 40% lacked 
BLAST results thus limiting annotation coverage to a 
maximum of 60%. This dependency on BLAST results 
is shared with most annotation methods, including 
AutoFact 1 1 , Blast2GO 9 and GOtcha 1 2. In the case 
of Sma3s, the effect of the BLAST search is especially im- 
portant given that Sma3s was able to annotate 90% of 
array sequences with BLAST alignments, largely due to 
the novel M3 module (Table 2). 

4.2. The combined use of the three Sma3s modules 
generates accurate annotation, with enhanced 
coverage from the M3 module 

We have shown that Sma3s can annotate different 
dataset types using its three module structure. Modules 
Ml and M2 are able to identify closely related database 
sequences and extract specific annotations, albeit with 
low overall coverage (TC in Figs 4 and 5). Thus, 
modules Ml and M2 provide high-quality annotations 
(with lower REQ values), increasing the overall sensitivity 
and specificity of Sma3s (Sn and Sp in Fig. 4). 

Our findings suggest that the M3 module can in- 
crease SC in some cases, especially where close ortholo- 
gues lack database annotation. The initial BLAST result 
can be reused by theM3 module to obtain newannota- 
tionsfor sequences previously annotated byMI orM2. 
M3 annotation is broader, allowing new terms to be 
found (Fig. 4) but with lower specificity, thus increasing 
the chance of generatingfalse positives. For most appli- 
cations, using the three Sma3s modules sequentially 
balance the higher accuracy of Modules 1 and 2 with 



the identification of useful annotations from Module 
3 for otherwise unannotated sequences. 

Our results demonstrate that the Sma3s M3 module 
is able to generate new annotations for sequences that 
were unannotated in theiroriginal databases. Thus,the 
M3 module may be valuable as a database curation 
tool, adding new annotations to existing public data- 
bases. However, it is important to take into account 
that this module is likely to generate annotations with 
lower specificity (Sp in Figs 4 and 5). Ideally, users 
would be able to choose between high sensitivity and 
high specificity according to the goal of the experiment. 

4.3. Sma3s has low computing requirements 

The Sma3s algorithm comprises two different basic 
steps: similarity searching by BLAST and annotation by 
different strategies in each of the three modules. For 
high-throughput analyses, the BLAST itself accounts 
for most of the calculation time. Result execution 
times (Fig. 6) demonstrate that parallel BLAST searches 
could accelerate the process without affecting the rest 
of the algorithm. In fact, both Sma3s and Blast2GO 
can use separately generated BLAST reports, which 
may allow accelerated analysis if the user was to 
execute similarity searches in parallel. However, the 
speed advantage of Sma3s over Blast2GO is due to 
more than the speed of the BLAST search. Another 
reason for the speed of Sma3s is that it uses Swiss- 
Prot, a relatively small high-quality database, which 
itself increases annotation accuracy (Figs 2, 3 and 5). 
These resultstogetherallowforafasterand more accur- 
ate annotation process when analysing large datasets. 
Thus, Sma3s can annotate a set of 20 000 sequences 
in ~3 h, without human intervention. Finally, memory 
usage is low (Fig. 6b), alleviating the need for high per- 
formance computing facilities. 

I nconclusion,Sma3s is a fast and accurate method for 
annotating massive nucleotide or amino acid datasets 
in a way that is readily accessible for non-expert users. 
lnfact,Sma3shasalready been used in thefunctional an- 
notation of both the olive^^ and the maritime pine^° 
transcriptomes, and is now being used for the annota- 
tion of both a new bacterial genome and an ESTdatabase 
of almost 300 plant datasets containing a total of 8.6 
million sequences (data not published yet). 
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